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This study investigated the practice of weighting a type of 
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test. The study used data from statewide writing field tests in grades 3, 5, 
and 8 and considered two contexts, that in which a single extended response 
writing prompt is "intentionally" or "purposefully" weighted twice in 
computing student scores (ER context) and one in which a set of constructed 
response items, including one extended response item, is intentionally 
weighted twice (CR context) . The weighting option was compared with the use 
of no weighting. In either context, the criterion for the two options 
(weighting and no weighting with a shorter form) was the administration of 
twice as many items as the items that are deliberately weighted, combined 
with no use of weighting. The three options were compared in terms of student 
scores as well as raw-score-to-scale score conversion tables. The state uses 
number- correct scoring as opposed to pattern scoring. Either intentionally 
weighted or un- weighted scores on the shorter form are, on average, very 
comparable to the criterion un- weighted scores from the longer form. On the 
level of individual student scores, as compared with the un- weighted 
shorter- form scores, more of the weighted shorter- form scores (2-5% more in 
the ER context and 1-9% more in the CR context) differ from the target scores 
by more than 10 points. The ramifications of the small decreases in 
individual score accuracy associated with purposeful weighting would depend 
on the purposes for which the scores are used and other factors. However, it 
is clear that purposeful weighting can never compensate for the loss of score 
accuracy caused by the shorter length of an actual test taken. Two appendixes 
contain a discussion of the item response models used in the study and a 
sample graph for one test item. (Contains 13 tables, 24 figures, and 12 
references . ) (Author/SLD) 
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Abstract 

The study investigated a fairly recent practice in some states that weights a type of item (e.g., 
constructed-response items) more than other types of items (e.g., selected-response items) to 
compute student scores for a mixed-item-type test. The study used data from statewide writing 
field-tests in three grades (3, 5, and 8) and considered two contexts; the context where a single 
extended-response writing prompt is “intentionally” or “purposefully” weighted twice in 
computing student scores (“ER context”), and the context where a set of constracted-response 
items, including one extended-response item, is intentionally weighted twice (“CR context”). 

The weighting option was compared against no use of such weighting. In either context, the 
criterion for the two options (weighting and no weighting with a shorter form) was the 
administration of twice as many items as the items that are deliberately weighted, combined with 
no use of weighting (no weighting with a longer form). 

The three options were compared in terms of student scores as well as raw-score-to-scale-score 
conversion tables. The state uses number-correct scoring as opposed to pattern scoring. Either 
intentionally weighted or unweighted scores on the shorter form are, on average, very 
comparable to the criterion unweighted scores from the longer form. On the level of individual 
student scores, as compared with the unweighted shorter-form scores, more of the weighted 
shorter-form scores (2-5% more in the ER context, and 1-9% more in the CR context) differ 
from the target scores by more than ten points. The ramifications of the small decreases in 
individual score accuracy associated with purposeful weighting would depend on the purposes 
for which the scores are used and other factors. However, it is clear that purposeful weighting 
can never compensate for the loss of score accuracy caused by the shorter length of an actual test 

taken. 
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Introduction 

Although weighting composite scores to obtain a weighted sum has been done for years (e.g., a 
weighted sum of verbal and quantitative scores), deliberately weighting one type of item in a test 
more than others is a recent practice. This practice has arisen in some assessment programs as 
more testing programs adopt tests containing mixed item types, such as selected-response (SR), 
short constructed-response (CR), and extended-response (ER) types. One of the motivations for 
the practice seems to be the desire to allow open-ended items to have a greater impact on the 
total test score. The present study investigated the impact of purposefully v^eighting a type of 
item in writing tests that contained all three item types. The kind of weighting investigated here 
is intentionally or purposefully imposed by human judgment, as opposed to other 
psychometrically-based weighting schemes (e.g., item pattern scoring based on item response 
theory, weighting based on reliability, weighting by the numbers of items). 

The issue of intentional weighting was investigated in two situations: (a) a single extended 
writing prompt was weighted twice, so that the score points from extended writing were doubled 
(“ER context”), and (b) a set of short constructed-response and extended-response items were 
weighted twice (“CR context”). Take situation (a) as an example. A frequently used, 

“traditional” approach to attaining twice as many score points from extended writing as from a 
single prompt is to actually administer a second comparable prompt. Although this option of 
administering an additional item or items is more desirable from the perspective of the 
generalizability of total test scores, it may not be viable because of the increase in testing time. 

A possible alternative is to administer a single extended-response item and explicitly give it 
twice the weight, so the result would be the same as the conventional approach with regard to the 
sum of score points. Surely, one cannot expect every score, weighted or unweighted, from a 
shorter test with a single set of open-ended items to be very similar to the student’s score from a 
longer test with twice as many open-ended items. However, given the wish to have open-ended 
items contribute more to the total test score, intentional weighting would be a viable alternative if 
it yields student scores that are fairly comparable to those obtained without such weighting when 
both are compared with scores from the conventional approach of “twice as many items.” 

The study compared the three options using live data from a state. The three options are; 

• Option 1 (a longer form with no weighting) ; requires administering twice as many ER 
and/or CR items, but involves no explicit weighting. This option honors the desire to weight 
an item or items of a selected type(s) by administering more items. 

[denoted in tables and figures “(unweighted) ER2” and “(unweighted) CR2, ’ respectively, 
for the ER and CR contexts], 

• Option 2 (a shorter form with double-weighting) : administers a single ER item or a single 
set of CR and ER items, and gives these items double weight in producing scores. This 
option also honors the wish to have an item or items weighted twice. 

[denoted in tables and figures “weighted ERl” and “weighted CRl,” respectively, for the ER 
and CR contexts], and 

• Option 3 (a shorter form with no weighting) ; requires the administration of the same 
number of items as option 2, but involves no intentional weighting. Unlike option 1 or 2, this 
option disregards the desire to weight an item or items of a selected type(s). 
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[denoted “unweighted ERl” and “unweighted CRl,” respectively, for the ER and CR 
contexts]. 

To reiterate, the target option is option 1, while the option of primary interest is option 2. The 
reason option 3 was added is that the option of interest (option 2) differs from the criterion 
option 1 in two aspects: test length and the use of purposeful weighting. Option 3, which differs 
from option 1 in only one aspect, test length, helps unravel the confounding. Any difference 
observed between option 1 and option 3 represents what is expected from the difference in form 
length. By comparing the option 2 - option 1 difference against the option 3 - option 1 
difference, the effect of intentional weighting could be examined with the impact of test length 
removed. 

The following summarizes ail the option - context combinations: 



Option 


ER context 


CR context 


Option 1 


(Unweighted) ER2 


(Unweighted) CR2 


Option 2 


Weighted ERl 


Weighted CRl 


Option 3 


Unweighted ERl 


Unweighted CRl 



In the study, the three options were compared in terms of raw-score-to-scale-score (RS-SS) 
conversion tables and the resulting scale scores. The testing program from which the tests and 
data came utilizes item response theory (IRT) to obtain the parameter estimates for the items and 
employs the number-correct scoring method in which each possible raw-score total is converted 
to a scale score based on a RS-SS table, which, in turn, is based on the item parameter estimates. 



Method 



Data Source 

Data came from a field-test of Writing items for a state assessment program in grades 3, 5, and 8. 
The field tests contained two ER items and five, nine, or 15 CR items, depending on the grade, as 
well as 38 or 50 SR items. The field-test items were analyzed and calibrated using the IRT 
models described in Appendix A. From the pool of field-tested items, a set of items was selected 
for an operational form in each grade that met the content blueprint and the statistical criteria. 

Test Forms Used in the Study 

Regardless of the grade, the operational form contains a single ER item and was used with 
weighting as the “weighted ERl” form for option 2 and used with no weighting as the 
“unweighted ERl” form for option 3. The “(unweighted) ER2” form for option 1 was created by 
adding the other ER item, which resulted in two ER items. Attempts were made to make the two 
ER items as equally difficult as possible. The ERl and ER2 forms have the identical set of SR 
and CR items. 

Next, the “CRl” and “CR2” forms were constructed in such a way that the CR2 form has twice 
as many CR and ER items as the CRl form and yet they are as comparable as possible in test 
difficulty. Because the CR2 form has more CR and ER items, it inevitably differs from the CRl 
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form in terms of internal consistency and content coverage. As in the ER context, the 
“(unweighted) CR2” form was used for option 1, whereas the CRl form was used with 
weighting as the “weighted CRl” form for option 2 and used without weighting as the 
“unweighted CRl” form for option 3. 



All the forms within a grade (i.e., ERl, ER2, CRl, and CR2) have the identical set of SR items. 
Tables 1-3 present lists of items included in each of the forms. The check mark (V) indicates 
that the item is in the form. The tables also show the number of students in the calibration 
sample, the mean p-values, and Feldt-Raju reliability values. The ERl and ER2 forms, and the 
CRl and CR2 forms are comparable in average test difficulty. 

At the bottom of each table are the mean p-value for the ER and CR items that are in the CR2 
form but not in the CRl form, and the mean p-value for the ER and CR items that are in the CRl 
form, that is, the ER and CR items that are doubly weighted. These mean p-values are within .02 
of each other, suggesting the approximate equivalence in average difficulty between the 
additional items in the CR2 form and the doubly-weighted “virtual” items in the weighted CRl 

form. 

Table 4 presents further details of the forms. The table shows the numbers of SR, CR, and ER 
items, the total numbers of items and score points in the form, and the form’s coverage of 
content by objective. Note that in the CRl forms, the number of ER and CR items to be weighted 
twice increases over three grades. The grade 3 CRl form has three ER/CR items to be weighted; 
the grade 5 CRl form has 5 such items; the grade 8 CRl form contains 7 such items. 



Analyses . 

As noted before, the field-test items were calibrated to obtain the item parameter estimates. At 

each grade, the item parameter estimates from the field-test calibration were used as the true 
values in the comparisons of the forms. For example, after the CR2 form and the CRl form were 
constructed from the field-test form, they were not re-calibrated. The same parameter estimates 
from the single field-test calibration were used for the same item in all study forms^in which it 
appeared, whether the item was in the CRl form or the CR2 form or the ER forms. The 
parameter estimates were initially in the logit-like metric, but were placed onto a scale-score 
(SS) scale with a multiplier of 50 and an additive constant of 500. The highest and lowest scale 
scores of 200 and 800 were imposed on students’ scale scores. 

Weighting came into play when RS-SS tables were generated using these parameter estimates. 

All RS-SS tables were produced using CTB’s proprietary software program named “FLUX’ . 

For option 1 with the ER2 or CR2 form that was always unweighted, a usual, unweighted RS-SS 
table was generated. For option 2 with weighting, a weighted RS-SS table was generated using 
the ERl or CRl form by giving double weight to a single ER item or to all the CR and ER items. 
For option 3 with no weighting, an unweighted RS-SS table was produced using the ERl or CRl 



* If the items had been re-calibrated for each study fonn, the re-calibrated parameter estimates for the items, say, in 
the CR2 form would likely be different from tliose for the items in the CRl form. This is because the CR2 form 
contains more CR items than tlie CRl form. 
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form without engaging weighting. Table 5 provides an overview of the forms in terms of which 
items were deliberately weighted and which items were intentionally left unweighted. 

Using the RS-SS tables generated, students in the calibration samples were scored for each of the 
option - context combinations. 



Results 



The RS-SS tables for the ER context are provided in Tables 6-8. In Table 6, the RS-SS pairs 
for the grade 3 ER forms are listed for options 1-3, from left to right. Table 7 shows the 
comparable RS-SS pairs for the grade 5 ER forms, and Table 8 for the grade 8 ER forms. 

Comparisons of the RS-SS tables and the test characteristic curves (TCCs) 

The RS-SS pairs for the ER context are graphically depicted in Figures 1 -3 in terms of test 
characteristic curves (TCCs) with the percent correct, as opposed to the number correct score, on 
the Y axis. Although plots are often a great tool to evaluate differences, they usually do not 
show as much precision as the actual numbers. For example, in Figure 1, the option 1 and option 
2 TCCs are so close, with the maximum SS difference of 3 points, that they are not differentiated 
in the figure. In those cases particularly, the RS-SS tables should be consulted. 

The figures show that although the option 2 TCC is usually closer to the target option I TCC, 
both the option 2 and option 3 TCCs are so comparable to the criterion option I TCC that it 
seems reasonable to declare that both the weighted form (option 2) and the unweighted short 
form (option 3) are alternate forms of the unweighted long form with additional items (option I). 
The close comparability of the TCCs among the three options is particularly notable in the SS 
range between 400 and 550. Despite the similarities, the option-3 TCCs are slightly but 
consistently higher than the option- 1 TCCs on the low and high ends. 

Comparisons in standard errors (SEs) and SE curves 

In addition to the RS-SS pairs, the RS-SS tables (Tables 6-8) also show the standard errors 
(SEs) for the SSs. These are “constrained” standard errors. They are constrained in such a way 
that the upper or lower bound of the 1 SE band in SS for a given RS is never above the upper or 
lower bound of the I SE band in SS for (RS +1). 

The SEs are plotted in Figures 4-6. The RS-SS tables and the SE plots show that 

• in all cases, the SE curve for option 1 (ER2), as expected from longer form length, is the 
lowest throughout the SS range, 

• as far as the low end of the SS range below 350 is concerned, the SE curve for option 2 
(weighted ERl) is nearly identical or very similar to the criterion SE curve under option I 
(ER2). As compared with option 2, the SE curve for option 3 (unweighted ERI) is very 
slightly higher, that is, farther away from the target option- 1 curve. 

• for the mid-range of SSs between 350 and 600, the three SE curves for options I - 3 are 
either practically identical or very comparable. 
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• on the high end of the SS range above 600, although option 2 and option 3 are very similar in 
terms of SEs, the former has slightly higher SE than the latter. The SE plot for either option 
is substantially greater than that for the target SE for option 1. At a SS of 700, the linearly- 
interpolated SE for either option 2 or option 3 is about 10 to 20 SS points higher than that for 

the option 1 SE. 

The weighted SE curve is lower on the low end of the range, indicating that the ER item being 
weighted twice is located in this very easy range. However, the p-value for the item, .58 in Table 
1, suggests otherwise. The plot in Appendix B shows an unusual-looking TCC for the item, 
depicting a very low location. This phenomenon, which is not limited to this item, seems to be 
caused by the fact that in the item calibrations, the rounded average of six trait scores for the 
same prompt was used as the student’s response to the item. All ER items in the state’s tests 
were calibrated in this manner. Extended-response items are usually difficult. If the ER items 
had been scored in a typical fashion with no averaging of six trait scores, they would likely have 
lowered the SE curves on the high end of the SS range. 

Scale score cornparisions 

The comparisons of options 2 and 3 against option 1 with regard to the RS-SS tables and curves 
have demonstrated both marked similarities and noticeable dissimilarities, particularly on the 
extreme ends of the SS range. The crucial question is: how do the similarities and dissimilarities 
translate into differences in students’ scale scores? If the students are located in the SS range 
where the three options are very comparable, differences in the RS-SS tables may be 
inconsequential for most students. 

To answer this question, for each grade, three sets of scale scores (SSs) for the students in the 
scaling sample were compared, that is, option 1 SSs, option 2 SSs, and option 3 SSs. The results 
of the scale-score comparisons in the ER context are summarized in Table 12. 

Regardless of grade, the mean SS for either option 2 or option 3 is extremely similar to the 
criterion mean SS for option 1, indicating that the three options are tau-equivalent. The largest 
SS difference is half a scale-score point for the option 1 - option 3 comparison at grade 8. The 
SS standard deviations are very similar among the three options. 

To compare the three options on the level of the individual student, two difference scores were 
computed for each student, one between the option 2 SS and the option I SS and the other 
between the option 3 SS and the option 1 SS. The means and standard deviations of the 
difference scores, presented in Table 12, display the same pattern described above. The table 
also shows the percentages of students with “relatively small” difference scores. Two types of 
“relatively small” difference scores are used: differences equal to or smaller than 5 SS points in 
absolute magnitude, and those equal to or smaller than 10 SS points in absolute magnitude. 

Since minimum standard errors are in a proximity of 15 and 20 SS points, the 5-point SS 
difference is roughly a quarter to a third of a minimum SE, while the 10-point SS difference is 
approximately a half to two-thirds of a minimum SE. 

Across three grades, the weighted option-2 SSs are within 5 SS points of the target option-1 SSs 
in 60% - 66% of the students and within 10 SS points in 82% - 86% of the students. The 
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unweighted SSs are within 5 SS points of the criterion SSs in 61% - 80% of the students, and 
within 10 SS points in 86% - 89% of the students. Thus, for all three grades, either by the 5- 
point or 10-point criterion, the option-3 SSs are similar to the criterion option-1 SSs for more 
students than are the option-2 SSs, except when the 5-points criterion is used at grade 3. 

The pairs of SSs are plotted in Figures 7 through 12 to see how close the SSs are among the three 
options. For example. Figure 7 plots the option 1 SSs against the option 2 SSs for grade 3 
students, and Figure 8 plots the option 1 SSs against the option 3 SSs for the same students. 
Figure 9 is a plot of the option 1 SSs against the option 2 SSs for the grade 5 students, while 
Figure 10 is a plot of the option 1 SSs against the option 3 SSs for the same students. 

In terms of correlations that are included in the plots, the option 3 SSs seem as similar to the 
target option 1 SSs as are the option 2 SSs. One may note while comparing Figures 7 and 8 that 
the scatterplot for the option 3 - option 1 SS pairs is slightly tighter along the approximate 45- 
degree line than for the option 2 - option 1 SS pairs, indicating that the option 3 SSs are slightly 
closer to the criterion option 1 SSs than the option 2 SSs are to the criterion SSs. This is also 
observed in the comparisons for the remaining two grades, that is, between Figures 9 vs. 10 and 
between Figures 1 1 vs. 12. This characteristic is in line with the greater numbers of relatively 
accurate individual scores observed above for the no-weighting option (option 3), relative to the 
weighting option (option 2). 

The plots also reveal that the option- 1 and option-3 SSs are not tau-equivalent at the low (and 
high) ends of the SS range, indicating that the option-3 SSs are, on average, not very similar to 
the option-1 SSs at the extreme sections of the SS range. For example, the option-3 SSs for the 
students in a relatively low range tend to be lower, sometimes considerably lower, than the 
criterion SSs under option 1, but are rarely higher than the target SSs. This is related to the 
earlier observations that the TCCs for option 3 are slightly but consistently higher than those for 
option 1 on the low and high ends of the SS range. This means that the same low percentage of 
maximum possible number-correct score would always lead to a higher SS under option 1 than 
under option 3 at low and high ends. The lack of tau-equivalence on the high end is visible only 
in the grade 3 plots, simply because no students are present in the range at the other two grades. 



CR context ^ • , r j c 

The RS-SS tables for the CR context are provided in Tables 9 1 1, respectively, for grades 3, 5, 

and 8. In each table, the RS-SS pairs are listed for options 1 - 3, from left to right. The column 
“Diff SS” is absent from Tables 9 and 1 1, because the maximum number of raw-score points for 
the CR2 form does not equal twice the number of raw-score points for the CRl form, which 
necessitated the computation of the percentage of the maximum raw-score points (% NC) for 
each of the CR2, weighted CRl, and unweighted CRl forms. 
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Comparisons of the RS-SS tables and the test characteristic curves (TCCs) 

The RS-SS pairs for the CR context are graphically depicted in Figures 13 - 15 as test 
characteristic curves (TCCs). In Figure 15, the option 1 and option 2 TCCs appear on top of 
each other, although they, in fact, differ by up to 10 SS points as indicated in Table 11. 

As in the ER context, the option 2 TCC is nearly identical or very comparable to the option 1 
TCC The option 3 TCC is still substantially comparable to the criterion option 1 TCC through 
the range of SSs, although they are somewhat apart at the lower asymptote in grades 3 and 5, that 
is, in a SS range below 350, and, as seen in the ER context, consistently higher than the target 
TCC in the high and low ends of the SS range. 

Although the number of ER and CR items being doubly weighted increases over three grades (3, 

5 and 7 items for grades 3, 5, and 8), no systematic differences between options 1 and 2 in terms 
of the TCCs are observed over the grades. 

Comparisons in standard errors (SEs) and SE curves 

The constrained standard errors (SEs) for the SSs in Tables 9-11 are plotted in Figures 16 - 18. 
The RS-SS tables and the SE plots for the CR context manifest similar patterns that were seen in 

the ER context. Namely, 

• in all cases, the SE curve for option 1 (CR2), as expected, is the lowest throughout the SS 

range, r- • o 

• as far as the low end of the SS range below 350 is concerned, the SE curve for option 2 

(weighted CRl) is closer to the criterion SE curve for option 1 (unweighted CR2) than is the 
SE curve for option 3 (unweighted CRl). The option 3 SE curve deviates somewhat more 
from the option 1 SE curve in the CR context than in the ER context. Namely, as compared 
with the ER context, the SEs for option 3 in the CR context are higher than those of option 1 
by even more (i.e., by about 15 to 30 SS points) at the lower end, 

• for the mid-range of SSs between 350 and 600, the three SE curves for options 1 - 3 are still 
very comparable. Although the option 1 SE curve in the CR context is discernibly the 
lowest, the differences of SEs between option 1 and option 2 and between option 1 and 
option 3 are very small and largely within 5 SS points. 

• on the high end of the SS range above 600, although the option 2 curve tends to show slightly 
higher SE than the option 3 curve, they are once again very similar in terms of SEs. The SE 
plot for either option is substantially greater than that for the criterion SE for option 1 . At a 
SS of 700, the linearly-interpolated SE for either option 2 or option 3 is about 13 to 25 SS 
points higher than that for the option 1 SE. 

The locations of the weighted items range between 383 and 473 for grade 3, between 377 and 
525 for grade 5, and between 350 and 506. These locations seem to explain why the weighted 
CRl SE curves are lower than the unweighted CRl curves in the lower ends of the SS range. 

Scale score comparisions 

As in the ER context, for each grade, three sets of scale scores (SSs) for the students in the 
scaling sample were compared; that is, option 1 SSs, option 2 SSs, and option 3 SSs. The pairs 
of SSs for the CR context are plotted in Figures 19 through 24. At each grade, the option 1 SSs 
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are plotted against option 2 SSs on the left, and then against the option 3 SSs on the right. The 
plots also include the correlations. As in the ER context, the correlations between the option 1 
and option 2 SSs are very similar to those between the option 1 and option 3 SSs. 

Previously in the ER context, it was observed that the option 1 - option 3 SS pairs hugged the 
45-degree line more closely than the option 1 - option 2 SS pairs. This observation is much less 
apparent in the plots for the CR context. However, the lack of tau-equivalence at the low end of 
the SS range for the option 1 - option 3 scatterplots is still visible in the CR context. As noted 
earlier, this is in line with what has been observed in terms of TCC curves. As in the ER context, 
the option 1 - option 2 scatterplots for the CR context appear tau-equivalent. 

The numerical results of the scale-score comparisons in the CR context are summarized in Table 
13. As in the ER context (Table 12), both the option 2 and option 3 SSs are, on average, very 
comparable to the criterion option 1 SSs in all three grades. The differences in the SSs between 
option 1 and option 2 are, on average, within a half SS point, while they are approximately 
within a SS point between option 1 and option 3. The SS standard deviations (SDs) are very 
comparable among the three options, although the option 2 and option 3 SDs are more similar to 
each other than are the option land option 2 SDs, or the option 1 and option 3 SDs. 

The absolute mean difference score for the option 1 - option 3 comparison tends to increase from 
grade 3 to gradeS to grade 8 (|-.25|; |.54|; |1.12|). This may be caused by the increasing number 
of additional ER and CR items that the CR2 forms contain (3, 5, and 7 items in grades 3,5, and 
8). That is, the difference in test length and in items between the option 1 (CR2) and option 3 
(CRl) forms increases from the lowest grade to the highest grade. However, Table 12 for the ER 
context displays a similar pattern of quasi-increasing mean difference scores over grades, even 
though the difference between theER2 and ERl forms remains constant across grades (i.e., one 
ER item). Therefore, the increase over grades in the absolute mean difference score in the CR 
context may well be a coincidence. 

As before the percentage of students with relatively small within-student SS differences between 
option 1 and option 2 was compared with that for the option 1 - option 3 comparison. As before, 
“relatively small” was defined using two cut-off differences: 5 SS points or less, and 10 SS 
points or less, in absolute value. Regardless of the grades, the weighting option has produced 
scores that are within 5 SS points in 30% - 35% of the students and within 10 SS points in 53% - 
63%. The option of no weighting has generated scores that are within 5 SS points in 33 % - 
42%, and within 10 SS points in 59% - 64% of the students. Irrespective of the cut-off 
differences, the option 3 SSs are, consistently in all three grades, similar to the option 1 SSs in 
1% - 9% more students than are the option 2 SSs. A similar observation was made in the ER 

context. 



Discussion 

The weighting option, option 2, generally has produced percentage-raw-score-to-scale-score 
conversion tables that are more similar to the criterion tables (option 1) than the no-weighting 
option (option 3) throughout the scale-score range from 200 to 800. The option-3 tables slightly 
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differ from the target tables on the extreme ends. In terms of measurement error associated with 
scale scores, once again, option 2 has slightly or substantially lower standard errors than option 
3 except on the high end where option 3 has slightly lower errors. Smaller errors for the 
weighted scale scores in the lower range reflect the fact that the items weighted twice happen to 
be located in this range. In a way, the lower SEs for option 2 are expected, since some items are 
considered twice in the computation of the errors even though they were never actually taken 
twice by the students. It would be interesting to see if lower SEs for option 2 could be validated 
in a test-retest or alternate-form study. 

These relative similarities between the criterion option 1 and the weighting option with regard to 
the conversion tables do not translate straight into student scores. The three options are almost 
identical in terms of average scale scores. This is not surprising in light of the result of an 
unreported analysis that 92% - 96% of scores are located in a middle range between 350 and 600 
where the three options are substantially similar. Scatterplots of scale scores have revealed that 
in terms of marginal mean scale scores, the weighted option-2 mean scores seem to be more 
similar to the criterion mean scores than do the unweighted option-3 mean scores at the low and 
high ends of SS range. Thus, the weighting option compares very favorably against the no- 
weighting option in terms of both overall and marginal mean scores. 

The comparisons of options 2 and 3 relative to option 1 at the level of individual student scores 
have presented a slightly different picture. In the ER context, unweighted scores (option 3) are 
within 10 scale-score points of their criterion option- 1 scores in 2% - 5% more students than are 
weighted scores (option 2), and within 5 points in 14% - 15% more students in two of the three 
grades. In the CR context, unweighted scores are within 10 points of the target scores in 1% - 
9% more students than are weighted scores and within 5 points in 3% - 7% more students. Thus, 
the no-weighting option (option 3) tends to achieve closer approximation to the criterion scores 
than does the weighting option (option 2) at the level of individual students. However, if 110] 
scale-score deviations are acceptable, the 2% - 5% increases in less accurate scores with the 
weighting option in the ER context do not appear detrimental. Ten-point differences may be 
endured particularly in a situation where students are classified into categories, and their exact 
scores are not as crucial. Unfortunately, the state whose data were used for the study has not 
established performance cut-off scores, and the study could not evaluate the options in terms of 
their effects on student classifications into performance categories. The use of a shorter form 
and of explicit weighting may have an even smaller impact on performance classifications, 
particularly in the ER context. Even the 1% - 9% decreases in score precision in the CR context 
may be within a threshold under some circumstances. 

Although the evaluation of scale-score differences used two criteria (i.e., 5 and 10 points in 
absolute magnitude), the |5|-point criterion may seem too stringent. For example, differences 
seen between pairs of scale scores from number-correct (NC) and pattern scoring methods can be 
greater than |5l points. A separate analysis of data from another state based on two tests 
containing mixed item types has demonstrated that most (98% or 99%) of NC scores were within 
10 points of their corresponding pattern scores, while 87% - 90% of NC scores were within 5 
points. Despite documented increases in accuracy of scores by pattern scoring (e.g.. Yen & 
Candell, 1991), many practitioners opt for NC scoring for the reason that it is easier to 
understand, meaning that the magnitude of a decrease in precision associated with NC scoring is 
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well tolerated or even accepted. Thus, the 110|-point criterion, as opposed to the [Sj-point 
criterion, may be emphasized in assessing options 2 and 3 against option 1. 

The discussion so far has focused on the impact of deliberate weighting. How about the impact 
of test length? The effect of test length on score accuracy can be evaluated by comparing scores 
between option 1 and option 3. Options 1 and 3 differ in only one feature, test length. As 
expected, the effect of test length is considerably greater in the CR context than in the ER 
context. The difference between the longer and shorter forms in the ER context is a single ER 
item, whereas it is a set of CR and ER items in the CR context. Irrespective of the 15|- or llOj- 
point criterion, the percentages of students with unweighted scores that are relatively similar to 
the criterion option- 1 scores range in the 60% - 90%s for the ER context. The percentages 
diminish markedly in the CR context, between about 35% and 65%. Note that these appreciable 
drops in score precision in the CR context are caused by lacking only several (3 - 7 ER and CR) 
additional items. 

Weighting or no weighting, some practitioners would expect a higher level of score accuracy and 
favor a longer test, while others would tolerate lower precision and accept a shorter test. There 
are several factors that would impact a decision as to which way to go. For example, an obvious 
factor is the level of stakes involved in the decisions to be made based on student scores. 

Second, one should consider whether it is a single score, or a category in which a student is 
placed, that is crucial. Third, as noted in the Introduction, testing time may be a practical 
consideration. Fourth, as seen in Table 1, the tests with a single ER or a single set of CR/ER 
items with no weighting have reliabilities very similar to those with twice as many ER and/or CR 
items. Namely, in terms of test reliability, the shorter form should be considered acceptable. 
Some of the same factors may play a role in deciding between number-correct and item-pattern 
scoring. 

The study has other limitations. The study treated student scores from a longer form as the 
criterion scores. However, these scores may still differ from the true scores, and a simulation is 
called for. Moreover, the study did not address the aspect of content representation of the 
different forms. The form with a single set of ER and CR items with no weighting obviously has 
different content coverage than the forms with twice as many of these items or the “virtual” 
forms when these items are weighted twice. This could be a serious drawback for the no- 
weighting option. Furthermore, the live-data study has a few features that may not be shared by 
other assessments, such as the way the ER items were scored, and the fact that the doubly- 
weighted items were relatively easy. Due to these idiosyncrasies, the results, to some extent, 
may not be generalizable to other tests. 

In conclusion, the desire to explicitly “honor” the efforts taken by students to produce longer 
responses is understandable. This study has found no loss in the precision of average scores, 
both overall and throughout the range of scale scores, due to intentional double-weighting. On 
the level of individual student scores, deliberate weighting, as compared with no weighting, 
results in more students with less accurate scores. The increase in less accurate scores is 
relatively small - 5% on average - if differences from the criterion scores from a longer test up 
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to ten points are acceptable. However, it is clear that purposeful weighting can never overcome 
the loss of score accuracy caused by shorter test length.^ 



^ Options for future research include (1) a simulation study, and (2) anotlier alternative of summing two ratings for 
each CR item if each response is rated by two raters, tliereby doubling the number of score points contributed by 
each CR item, which is a different way of accomplishing tlie desired double-weighting. 
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Appendix A : Item response theory models used in the study 



Because the characteristics of selected-response (SR) and constructed-response (CR) items are 
different, two IRT models were used in item calibration. The three-parameter logistic model 
(3PL) (Lord & Novick, 1968; Lord, 1980) was used in the analysis of SR items. In this model, 
the probability that a student with ability^ responds correctly to item / is 



p,{e)=c,+ 



1-g. ^ 

1 + exp [-1.7 a, {0 - Z>, )] 



where a/ is the item discrimination, i, is the item difficulty, and c, is the probability of a correct 
response by a very low-scoring student. 



For analysis of the constructed-response items, the two-parameter partial credit model (2PPC) 
(Muraki, 1992; Yen, 1993) was used. The 2PPC model is a special case of Bock’s (1972) 
nominal’model. Bock’s model states that the probability of an examinee with ability 0 having a 
score (k-1) at the ^-th level of they-th item is 



{0) =k-\\0) = 



exp Zfc ]( 

mj 

exp 'Zji 

1=1 



where 



z„ = Ai,e + c„ . 

The ntj denotes the number of score levels for the y-th item, and typically, the highest score level 
is assigned (/wy - 1) score points. For the special case of the 2PPC model used here, the 

following constraints were used: 

and 

Cjk 0 . 

1 = 0 

where o.y and y,, are the free parameters to be estimated from the data. Each item has 

(m; -1) independent Vy, parameters and one ay parameter; a total of my parameters are estimated 

for each item. 

The IRT model parameters were estimated using CTB’s PARDUX software (Burket, 1991). 
PARDUX estimates parameters simultaneously for SR and CR items using marginal maximum 
likelihood procedures implemented via the EM (expectation-maximization) algorithm (Bock & 
Aitkin, 1981; Thissen, 1982). 
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Simulation studies have compared PARDUX with MULTILOG (Thissen, 1991), PARSCALE 
(Muraki & Bock, 1991), and BIGSTEPS (Wright & Linacre, 1992). PARSCALE, MULTILOG, 
and BIGSTEPS are among the most widely known and used IRT programs. PARDUX was 
found to perform at least as well as these other programs. 
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Table 1. 

Grade 3 Mean P-Values and Reliabilities (N = 2,574) 



Item Type Item# P-value ER2 



SR 



ER1 



1 


.520 


2 


.640 


3 


.780 


4 


.619 


5 


.371 


6 


.638 


7 


.805 


8 


.787 


9 


.723 


10 


.768 


11 


.817 


12 


.788 


13 


.804 


14 


.533 


15 


.636 


16 


.655 


17 


.760 


18 


.669 


19 


.689 


20 


.749 


21 


.712 


22 


.785 


23 


.850 


24 


.840 


25 


.408 


26 


.564 


27 


.838 


28 


.785 


29 


.924 


1 


.681 


2 


.606 


3 


.781 


4 


.751 


5 


.740 



V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 



V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 



CR2 



V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 
>/ 

V 

V 

V 

V 

V 

V 

V 

V 



CR1 



V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 



CR 






>/ 

>/ 



V 

V 






ER 



1 

2 



.598 

.576 









Mean p-value 
Feldt-Raju reliability 



.699 

.911 



.702 

.907 



Mean p of ER/CRs that are in CR2 but not in CR1 = .68 
Mean p of ER/CRs that are doubly \weighted = .65 



.699 

.917 



.701 

.905 
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Table 2. 

Grade 5 Mean P-Values and Reliabilities (N = 2,642) 



Item Type Item # 


P-value 


ER2 


ER1 


CR2 


CR1 


SR 1 


.500 


V 


V 


V 


V 


2 


.785 


V 


V 


V 


V 


3 


.613 


V 


V 


V 


V 


4 


.761 


V 


V 


V 


V 


5 


.697 


V 


V 


V 


V 


6 


.568 


V 


V 


V 


V 


7 


.311 


V 


V 


V 


V 


8 


.265 


V 


V 


V 


V 


9 


.453 


V 


V 


V 


V 


10 


.783 


V 


V 


V 


V 


11 


.776 


V 


V 


V 


V 


12 


.798 


V 


V 


V 


V 


13 


.485 


V 


V 


V 


V 


14 


.741 


V 


V 


V 


V 


15 


.690 


V 


V 


V 


V 


16 


.757 


V 


V 


V 


V 


17 


.282 


V 


V 


V 


V 


18 


.902 


V 


V 


V 


V 


19 


.766 


V 


V 


V 


V 


20 


.575 


V 


V 


V 


V 


21 


.594 


V 


V 


V 


V 


22 


.488 


V 


V 


V 


V 


CR 1 


.712 


V 


V 


V 


V 


2 


.684 


V 


V 


V 


V 


3 


.702 


V 


V 


V 




4 


.738 


V 


V 


V 




5 


.627 


V 


V 


V 




6 


.439 


V 


V 


V 


V 


7 


.559 


V 


V 


V . 




8 


.841 






V 


V 


ER 1 


.606 


V 


V 


V 


V 


2 


.556 


V 




V 




Mean p-value 




.620 


.622 


.627 


.625 


Feldt-Raju reliability 




.879 


.873 


.885 


.848 



Mean p of ER/CRs that are in CR2 but not in CR1 = .64 
Mean p of ER/CRs that are doubly weighted = .66 
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Table 3. 

Grade 8 Mean P-Values and Reliabilities (N = 3,633) 



Item Type Item# P-value ER2 



ER1 



CR2 



CR1 



SR 



CR 



ER 



1 


.811 


2 


.789 


3 


.624 


4 


.823 


5 


.836 


6 


.622 


7 


.835 


8 


.595 


9 


.679 


10 


.668 


11 


.826 


12 


.370 


13 


.378 


14 


.692 


15 


.612 


16 


.433 


17 


.731 


18 


.927 


19 


.317 


20 


.722 


21 


.517 


22 


.652 


23 


.451 


24 


.461 


25 


.439 


26 


.763 


1 


.599 


2 


.707 


3 


.283 


4 


.375 


5 


.495 


6 


.818 


7 


.927 


8 


.940 


9 


.741 


10 


.456 


11 


.560 


12 


.378 


13 


.753 



1 

2 



Mean p-value 
Feldt-Raju reliability 



V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 



V 

V 

V 

V 

V 

V 



.657 

.640 



V 

V 



V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 



V 

V 

V 

V 

V 

V 



.622 

.889 



.621 

.883 
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Mean p of ER/CRs that are in CR2 but not in CR1 = .66 
Mean p of ER/CRs that are doubly >weighted = .64 



V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 



V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 



V 

V 



.640 

.914 



V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 

V 



V 

V 

V 



.637 

.889 
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Table 4. 

Numbers of SR. CR, ER Items and Score Points, and Objective Coverage 





ER2 


Form 

ER1 CR2 


CR1 






Grade 3 




# SR items 


29 


29 


29 


29 


# CR items 


3 


3 


4 


2 


# 2-pt items 


3 


3 


3 


2 


# 3-pt items 






1 




# ER items (5 pts each) 


2 


1 


2 


1 


Total # items 


34 


33 


35 


32 


Total # score points 


45 


40 


48 


38 


# of CR / ER items in: 










ObJ. 3 


2 


1 


2 


1 


ObJ. 4 


1 


1 


2 


1 


ObJ. 5 


1 


1 






Obi. 6 


1 


1 


2 


1 






Grade 5 




# SR items 


22 


22 


22 


22 


# CR items (2 pts each) 


7 


7 


8 


4 


# ER items (5 pts each) 


2 


1 


2 


1 


Total # items 


31 


30 


32 


27 


Total # score points 


46 


41 


48 


35 


# of CR / ER items in: 










ObJ. 2 


4 


4 


4 


2 


ObJ. 5 


2 


1 


2 


1 


Obi. 6 


3 


3 


4 


2 






Grade 8 




# SR items 


26 


26 


26 


26 


# CR items 


6 


6 


12 


6 


# 2-pt items 


6 


6 


11 


5 


# 4-pt items 






1 


1 


# ER items (5 pts each) 


2 


1 


2 


1 


Total # items 


34 


33 


40 


33 


Total # score points 


48 


43 


62 


45 


# of CR / ER items in: 










ObJ. 2 


2 


2 


4 


2 


ObJ. 3 


2 


2 


5 


2 


ObJ. 4 


1 


1 


2 


1 


ObJ. 5 


1 


1 


1 


1 


Obi. 6 


2 


1 


2 


1 




2 
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Table 5. 

Option - Context Combinations and Where Intentional (Un)Weighting was Used 



ER Context 



Option 1 


Option 2 


Option 3 


(ER2) 


(Weighted ER1) (Unweighted ER1) 




Grade 3 




2ERs 


1 ER weighted 


1 ER unweighted 


3 CRs 


3 CRs 


3 CRs 


29 MCs 


29 MCs 


29 MCs 




Grade 5 




2 ERs 


1 ER weighted 


1 ER unweighted 


7 CRs 


7 CRs 


7 CRs 


22 MCs 


22 MCs 


22 MCs 




Grade 8 




2 ERs 


1 ER weighted 


1 ER unweighted 


6 CRs 


6 CRs 


6 CRs 


26 MCs 


26 MCs 


26 MCs 



CR Context 



Option 1 
(CR2) 


Option 2 
(Weighted CR1) 


Option 3 

(Unweighted CR1) 


2 ERs 


Grade 3 
1 ER weighted 


1 ER unweighted 


4 CRs 


2 CRs weighted 


2 CRs unweighted 


29 MCs 


29 MCs 


29 MCs 


2 ERs 


Grade 5 
1 ER weighted 


1 ER unweighted 


8 CRs 


4 CRs weighted 


4 CRs unweighted 


22 MCs 


22 MCs 


22 MCs 


2 ERs 


Grade 8 
1 ER weighted 


1 ER unweighted 


12 CRs 


6 CRs weighted 


6 CRs unweighted 


26 MCs 


26 MCs 


26 MCs 
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Table 6. 

Grade 3 RS-to-SS Tables : Op. 1 (ER2), Op. 2 (Weighted ER1), and Op.3 (Unweighted ER1) 
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Table 6. 

Grade 3 RS-to-SS Tables ; Op. 1 (ER2), Op. 2 (Weighted ER1), and Op.3 (Unweighted ER1) 
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Table 7. 

Grade 5 RS-to-SS Tables : Op. 1 (ER2), Op. 2 (Weighted ER1), and Op. 3 (Unweighted ER1) 
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Table 8. 

Grade 8 RS-to-SS Tables : Op. 1 (ER2), Op. 2 (Weighted ER1), and Op. 3 (Unweighted ER1) 
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Table 8. 

Grade 8 RS-to-SS Tables : Op. 1 (ER2), Op. 2 (Weighted ER1), and Op. 3 (Unweighted ER1) 
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Table 9. 

Grade 3 RS-to-SS Tables : Op. 1 (CR2), Op. 2 (Weighted CR1) and Op. 3 (Unweighted CR1) 
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Table 9. 

Grade 3 RS-to-SS Tables : Op. 1 (CR2), Op. 2 (Weighted CR1) and Op. 3 (Unweighted CR1) 
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Table 1 1 . Intentional Weighting 

Grade 8 RS-to-SS Tables : Op. 1 (CR2), Op. 2 (Weighted CR1), and Op. 3 (Unweighted CR1) 
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Table 11. Intentional Weighting 

Grade 8 RS-to-SS Tables : Op. 1 (CR2), Op. 2 (Weighted CR1), and Op. 3 (Unweighted CR1) 
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Table 12. 

Scale-Score Comparisons of the Three Options : ER Context 



Grade 3 : ER forms 



Scale scores: 


Mean 


SD 






Option 1 ; ER2 


497.5 


66.6 






Option 2 : Weighted ER1 


497.4 


68.6 






Option 3 : Unweighted ER1 


497.3 


69.3 


% students 


% students 


Difference scores: 


Mean 


SD 


w/ diff. ^ |5| 


w/diff. s |10| 


Option 2 SS - Option 1 SS 


-0.06 


13.10 


66% 


82% 


Option 3 SS - Option 1 SS 


-0.20 


12.07 


61% 


86% 



Grade 5 : ER forms 



Scale scores: 


Mean 


SD 






Option 1 : ER2 


499.7 


59.5 






Option 2 ; Weighted ER1 


499.4 


58.7 






Option 3 ; Unweighted ER1 


499.9 


61.0 


% students 


% students 


Difference scores: 


Mean 


SD 


w/ diff. S |5| 


w/diff. S |10| 


Option 2 SS - Option 1 SS 


-0.30 


9.26 


66% 


84% 


Option 3 SS - Option 1 SS 


0.16 


8.01 


80% 


89% 



Grade 8 : ER forms 



Scale scores: 


Mean 


SD 






Option 1 ; ER2 


488.2 


68.6 






Option 2 : Weighted ER1 


488.6 


69.1 






Option 3 : Unweighted ER1 


488.7 


70.4 












% students 


% students 


Difference scores: 


Mean 


SD 


w/ diff. S |5| 


w/diff. S |101 


Option 2 SS - Option 1 SS 


0.43 


10.32 


60% 


86% 


Option 3 SS - Option 1 SS 


0.51 


10.11 


75% 


88% 
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Table 13. 

Scale-Score Comparisons of the Three Options ; CR Context 



Grade 3 : CR forms 



Scale scores: 


Mean 


SD 






Option 1 : CR2 


501.2 


64.4 






Option 2 : Weighted CR1 


500.9 


68.2 






Option 3 : Unweighted CR1 


501.0 


68.5 


% students 


% students 


Difference scores: 


Mean 


SD 


w/ diff. ^ |5| 


w/diff. ^ |10| 


Option 2 SS • Option 1 SS 


-0.28 


17.94 


35% 


63% 


Option 3 SS • Option 1 SS 


-0.25 


17.16 


42% 


64% 



Grade 5 : CR forms 



Scale scores: 


Mean 


SD 






Option 1 : CR2 


499.5 


59.7 






Option 2 ; Weighted CR1 


499.7 


62.7 






Option 3 ; Unweighted CR1 


500.0 


62.1 


% students 


% students 


Difference scores: 


Mean 


SD 


w/ diff. ^ |5| 


w/diff. ^ |10| 


Option 2 SS - Option 1 SS 


0.25 


17.13 


30% 


53% 


Option 3 SS - Option 1 SS 


0.54 


15.71 


33% 


59% 



Grade 8 : CR forms 



Scale scores: Mean 

Option 1 : CR2 488.7 

Option 2 : Weighted CR1 489.3 

Option 3 ; Unweighted CR1 489.9 

Difference scores: Mean 

Option 2 SS - Option 1 SS 0.54 

Option 3 SS - Option 1 SS 1.12 



SD 






65.3 






67.9 






67.6 








% students 


% students 


SD 


w/ diff. ^ |5| 


w/diff.^ |10| 


14.61 


34% 


54% 


13.69 


37% 


63% 




51 



Figure 1 . 

Grade 3 TCCs ; Op. 1 (ER2), Op. 2 (Weighted ER1), & Op. 3 (Unweighted ER1) 
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Figure 11. Figure 12. 

Grade 8 : Scale-score plot of Op. 1 (ER2) SSs and Grade 8 : Scale-score plot of Op. 1 (ER2) SSs and 

Op. 2 (weighted ERl) SSs Op. 3 (unweighted ERl) SSs 
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Figure 23. Figure 24. 

Grade 8 ; Scale-score plot of Op. 1 (CR2) SSs and Grade 8 ; Scale-score plot of Op. 1 (CR2) SSs and 

Op. 2 (weighted CRl) SSs Op. 3 (unweighted CRl) SSs 
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Total Number of Items 


Reporting 
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Grade 2 
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Grade 5 


Grade 6 


Grade 7 
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Editing: 
Capitalization 
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Spelling 


11 


9 


2 


4 


5 


5 


5 




12 


16 


4 




23 


6 






13 








25 








16 
















17 














Language Structure 


4 


3 


1 


1 


1 


1 


1 


(Syntactic) 


16 


4 


3 


3 


2 


2 


2 




17 


6 




4 


3 


3 


4 






8 






5 


6 


7 


Meaning 


5 


5 


5 


5 


3 


2 


2 


(Semantic) 


6 


6 


6 


6 


6 


6 


5 




7 


15 


7 


7 


9 


7 


6 




8 


18 


8 


8 


10 


9 


7 




10 


20 


15 


9 


18 


10 


9 




16 


21 


16 


10 


19 


11 


10 




17 




17 


11 


20 


12 


11 




20 




18 


14 


22 


13 


12 










15 


23 


14 


13 










16 


24 


15 


14 










17 


25 


17 


16 










18 


26 


18 


17 










19 


27 


19 


19 










20 


28 


20 


20 














21 


22 














22 


23 














23 
















24 
















25 
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MISSISSIPPI GRADE LEVEL TESTING PROGRAM 
CRT FALL 2000 MATHEMATICS BLUE PRINTS 



Mathematics 


Total Number of Items 


Reporting 

Categories 


Grade 2 


Grade 3 


Grade 4 


Grade 5 


Grade 6 


Grade 7 


Grade 8 


Patterns, 

Algebraic 

Thinking 


7MC 
1 OE 


7 MC 
1 OE 


SMC 
1 OE 


6MC 


7MC 
1 OE 


9MC 
1 OE 


9MC 
1 OE 


Data Analysis 
Prediction 


7MC 
1 OE 


7 MC 
1 OE 


7MC 
1 OE 


7MC 
1 OE 


7MC 
1 OE 


7MC 
1 OE 


7MC 
1 OE 


Measurement 


9MC 
1 OE 


9MC 
1 OE 


SMC 
1 OE 


SMC 
1 OE 


SMC 
1 OE 


5 MC 
1 OE 


5 MC 
1 OE 


Geometric 

Concepts 


6MC 
1 OE 


6MC 


7 MC 
1 OE 


9 MC 
1 OE 


SMC 
1 OE 


9MC 
1 OE 


9MC 
1 OE 


Number Sense 


21 MC 
1 OE 


21 MC 
2 0E 


23 MC 
1 OE 


20 MC 
2 0E 


20 MC 
1 OE 


20 MC 
1 OE 


20 MC 
1 OE 




1 
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Mathematics 



Reporting 

Categories 



Patterns, 

Algebraic 

Thinking 



Benchmarks/Items 



Grade 2 

5b 

1b 

6b 

6j 



Grade 3 

1a 

1b 

1c 

8a 

8b 

8c 



Grade 4 Grade 5 Grade 6 



la 

1b 

1c 



5j 



la 

1b 

1c 

Id 

1e 

If 



Grade 7 



7a 

7b 

7c 

7d 

7e 



Grade 8 



3a 

3b 

3c 

3d 

3e 

3f 

3g 

3h 



8a 

8b 

8c 

8d 



3a 

3b 



3c 

3d 

3e 

3f 



3g 



Data Analysis 
Predication 



3a 

3b 

3c 

3d 



4a 

4b 

4c 

4d 

4e 



4a 

4b 

4c 



3a 

3b 

3c 

3d 



5a 

5b 

5c 

5d 

5e 

5f 



4a 

4b 

4c 

4d 

4e 

4f 



7a 

7b 

7c 

7d 

7e 

7f 



4g 

4h 



7g 



best copy available 




2 
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MISSISSIPPI GRADE LEVEL TESTING PROGRAM 
CRT FALL 2000 MATHEMATICS BLUE PRINTS 



Measurement 


2a 


3a 


3a 


2a 


4a 


3a 


5a 




2b 


3b 


3b 


2b 


4b 


3b 


5b 




2c 


3c 


3c 


2c 


4c 


3c 


5c 




2d 


3d 


3d 


2d 


4d 




5d 




2e 


3e 


3e 


2e 


4e 








4a 


3f 




2f 


4f 








4b 


3g 




2g 










4c 


5a 














4d 


5b 














4e 


5c 














5a 


5d 














5b 


5e 














5c 
















5d 
















5e 
















5f 
















5g 















Mathematics 



Benchmarks/Items 



Reporting 

Categories 


Grade 2 


Grade 3 


Grade 4 


Grade 5 


Grade 6 


Grade 7 


Grade 8 


Geometric 


1a 


2a 


2a 


la 


2a 


5a 


6a 


Concepts 


1b 


2b 


2b 


1b 


2b 


5b 


6b 


1c 


2c 


2c 


1c 


2c 


5c 


6c 




Id 


2d 


2d 


Id 


2d 


5d 


6d 




1e 


2e 


2e 


1e 


2e 


5e 


6e 








2f 


If 


3a 


5f 


6f 








2g 


ig 


3b 


5g 


6g 








2h 




3c 


5h 


6h 












3d 


5I 


6i 














5j 
















5k 
















5I 






3 
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MISSISSIPPI GRADE LEVEL TESTING PROGRAM 
CRT FALL 2000 MATHEMATICS BLUE PRINTS 



Number Sense 


6a 


6a 


5a 


4a 


6a 


la 


la 




6b 


6b 


5b 


4b 


6b 


1b 


1b 




6c 


6c 


5c 


4c 


6c 


1c 


1c 




6d 


6d 


6a 


4d 


6d 


Id 


Id 




6e 


6e 


6b 


4e 


6e 


1e 


1e 




6f 


6f 


6c 


4f 


6f 


If 


If 




6g 


6g 


6d 


4g 


6g 


1 g 


ig 




6h 


6h 


6e 


5a 


7a 


2a 


2a 




61 


61 


6f 


5b 


7b 


2b 


2b 




6j 


6j 


7a 


5c 


7c 


2c 


2c 




6k 


6k 


7b 


5d 


7d 


2d 


2d 




6i 


6i 


7c 


5e 


8a 


2e 


2e 




7a 


6m 


7d 


5f 


8b 


2f 


2f 




7b 


7a 


7e 


5g 


8c 


2g 


4a 




7c 


7b 


7f 


5h 


8d 


6a 


4b 




7d 


7c 


7g 


51 


8e 


6b 


4c 




7e 


7d 


7h 


5j 


9a 


6c 


4d 




8a 


7e 


7i 


5k 


9b 


6d 


4e 




8b 


7f 






9c 


8a 


4f 




8c 


7g 






9d 


8b 


4g 




8d 


7h 






9e 


8c 








71 






9f 


8d 








7j 






9g 


8e 








7k 






10a 


8f 








71 






10b 


8g 








7m 






10c 


8h 








7n 






lOd 


8I 








9a 






lOe 


8j 








9b 
















9c 
















9d 
















9e 
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O 

o 
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MISSISSIPPI GRADE LEVEL TESTING PROGRAM 
CRT FALL 2000 READING BLUE PRINT’S 



Reading 


Total Number of Items 


Reporting 

Categories 


Grade 2 


Grade 3 


Grade 4 


Grade 5 


Grade 6 


Grade 7 


Grade 8 


Context Cues 
(Semantic) 


5-7 MC 
0 CR 


6MC 
0 CR 


7-8 MC 
0 CR 


8-9 MC 
0 CR 


6MC 
0 CR 


5-6 MC 
0 CR 


5-6 MC 
0 CR 


Language 

Structure 

(Syntactic) 


5-7 MC 
0-1 CR 


6 MC 
0 CR 


5-8 MC 
0 CR 


5-6 MC 
0 CR 


6 MC 
0 CR 


6 MC 
0 CR 


6-8 MC 
0 CR 


Word Patterns 

(Phonetic 

Structure) 


5-6 MC 
0 CR 


6 MC 
0 CR 


Not 

Assessed 


Not 

Assessed 


Not 

Assessed 


Not 

Assessed 


Not 

Assessed 


Vocabulary 


7-10 MC 
0 CR 


6 MC 
0 CR 


8-9 MC 
0 CR 


5-6 MC 
0 CR 


6-7 MC 
0 CR 


6 MC 
0 CR 


6 MC 
0 CR 


Main Idea and 

Details 

(Textual) 


11-13 MC 
2-3 CR 


11-15 MC 
2-3 CR 


8-10 MC 
2-3 CR 


12-16 MC 
2-3 CR 


12-15 MC 
1-4 CR 


9-12 MC 
2-3 CR 


10-11 MC 
2 CR 


Extended 

Meaning/ 

Thinking 

(Metacognitive) 


9-15 MC 
2-3 CR 


11-15 MC 
2-3 CR 


13-15 MC 
1-2 CR 


10-14MC 
2 CR 


12-14 MC 
10-2 CR 


13-15 MC 
1-3 CR 


10-16 MC 
1-2 CR 


Workplace 

Data 

(Evaluative) 


Not 

Assessed 


Not 

Assessed 


4 MC 
1-2 CR 


4-5 MC 
0-1 CR 


4-7 MC 
0-4 CR 


6-8 MC 
0-1 CR 


5-7 MC 
1-2 CR 




1 
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MISSISSIPPI GRADE LEVEL TESTING PROGRAM 
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Reading 


CD 

CD 


nchmarks/ltems 


Reporting 


Grade 2 


Grade 3 


Grade 4 


Grade 5 


Grade 6 


Grade 7 


Grade 8 


Categories 


















4 


4 


7 


4 


5 




6 


Context Cues 


8 


11 


8 


8 


9 


6 


7 


(Semantic) 


16 


14 


13 


11 


14 




8 




21 




15 








20 




22 


11 


5 


5 


5 




5 


Language 


13 


15 


7 


8 


7 


5 


8 


Structure 




10 


9 


9 


10 






(Syntactic) 






12 












7 


10 


5 


5 


5 


5 


5 


Word Patterns 


14 


11 


10 


10 


6 


7 


8 


(Phonetic 


15 




11 


12 


7 


8 


3 


Structure) 


6 








11 




10 




8 


8 


7 


6 


8 




5 


Vocabulary 


21 


15 


9 


8 


9 


Writing 


6 




22 




40 


12 


11 


#7 


7 










15 


14 




20 




1 


4 


16 25 


16 23 


15 23 


22 21 


12 21 


Main Idea and 


4 


12 


17 29 


18 25 


14 24 


22 22 


14 22 


Details (Textual) 


18 


14 


19 19 


20 29 


14 30 


22 23 


16 23 




20 


17 


21 41 


21 34 


14 33 


22 31 


17 24 




21 


19 


22 42 


22 36 


22 


20 


18 33 




24 


20 


23 








20 




4 


4 


14 22 


22 30 38 


22 27 


11 23 


13 24 29 


Extended 


18 


12 


32 


22 31 


22 28 


13 25 


15 26 30 


Meaning/Thinking 


20 


16 


15 28 


22 32 


22 29 


16 26 


20 27 32 


(Metacognitive) 


23 


18 


34 


17 33 


22 30 


18 27 


22 28 




25 


19 


16 29 


22 34 


26 32 


22 28 






26 




35 


22 37 


34 












18 30 
















36 
















20 31 












2 


3 


16 38 


16 42 


14 38 


14 37 


12 37 


Workplace Data 


3 


20 


43 


18 43 


14 39 


15 38 


19 38 


(Evaluative) 






17 39 


27 44 


14 40 


20 39 


21 39 








25 40 


40 


14 41 


35 40 


35 








26 41 


41 


37 


36 41 


36 








37 42 
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